Astronomy & Astrophysics manuscript no. 12823 


© ESO 2009 


December 10, 2009 





Filaments in observed and mock galaxy catalogues 



R. S. Stoica 1 , V. J. Martinez 2 , and E. Saar 3 



On 
O 
O 
(N 

O 

Q 

o 



O 

U 

6 



O 

(N 

On 
O 



Universite Lille 1, Laboratoire Paul Painleve, 59655 Villeneuve d'Ascq Cedex, France 
e-mail: radu. stoica@math.univ-lillel . fr 

Observatori Astronomic and Departament d'Astronomia i AstrofTsica, Universitat de Valencia, Apartat de correus 22085, E-46071 

Valencia, Spain 

e-mail: martinez@uv.es 

Tartu Observatoorium, T5ravere, 61602 Estonia 

e-mail: saar@aai . ee 



Received / Accepted 



ABSTRACT 



Context. The main feature of the spatial large-scale galaxy distribution is an intricate network of galaxy filaments. Although many 
attempts have been made to quantify this network, there is no unique and satisfactory recipe for that yet. 

Aims. The present paper compares the filaments in the real data and in the numerical models, to see if our best models reproduce 
statistically the filamentary network of galaxies. 

Methods. We apply an object point process with interactions (the Bisous process) to trace and describe the filamentary network both 
in the observed samples (the 2dFGRS catalogue) and in the numerical models that have been prepared to mimic the data. We compare 
the networks. 

Results. We find that the properties of filaments in numerical models (mock samples) have a large variance. A few mock samples 
display filaments that resemble the observed filaments, but usually the model filaments are much shorter and do not form an extended 
network. 

Conclusions. We conclude that although we can build numerical models that are similar to observations in many respects, they may 
fail yet to explain the filamentary structure seen in the data. The Bisous-built filaments are a good test for such a structure. 

Key words. Cosmology: large-scale structure of Universe — Methods: data analysis — Methods: statistical 



1. Introduction 

The large-scale structure of the Universe traced by the three- 
dimensional distribution of galaxies shows intriguing patterns: 
filamentary structures connecting huge clusters surround nearly 
empty regions, the so-called voids. As an example, we show 
here a map from th e 2dF Galaxy Redshift Survey (2dFGRS, 
IColless et al.l ([2001 )). As an illustration of the filamentary net- 
work, Fig. Q] shows the positions of galaxies in two 2.6° 
thick slices from two spatial regions that the 2dFGRS covered. 
Distances are given in redshifts z. 

Filaments visually dominate the galaxy maps. Real three- 
dimensional filaments have been extracted from the galaxy 
distribution as a result of special observational projects 
( Pim bblet & Drinkw ater 2004 |), or by search ing for filaments in 
the 2dFGRS catalogue dPimbblet et al]|2004l) . These filaments 
have been searched for between galaxy clusters, determining the 
density distribution and deciding i f it is filamentary, individually 
for every filament dPimbb let 2005). Filaments are also suspected 
to hide half of the warm gas in the Universe; an exampl e of a dis- 
covery of such gas is the paper bv lWerner et alj ((2008). 

However, there are still no standard methods to describe the 
observed filamentary structure, but much work is being done 
in this direction. The usual second-order summary statistics as 
the two-point correlation function or the power spectrum do 
not provide morphological information. Minkowski function- 
al, minimal spanning tree (MST), percolation and shapefind- 
ers have been introduc ed for this purpose (for a review see 
iMartfnez & Saai1d2002l) ). 



The minimal s panning tree was introduced in cosmology by 
lBarrowetal1dl985l) . It is a unique graph that connects all points 
of the process without closed loops, but it describes mainly the 
local nearest-neighbour distribution and does not give us the 
global and large-scale properties of the filamentary network. 
A rece nt development of these ideas is presented by Colberg 
(2007). He applies a minimal spanning tree on a grid, and works 
close to the percolation regime - this allows the study of the 
global structure of the galaxy distribution. We note that using a 
grid introduces a smoothed density, and this is typical for other 
recent approaches, too. 

In order to describe the filamentary structure of continuous 
density fie lds, a skeleton meth od h as been propo s ed an d de- 
veloped bv lEriksen etaTJ d2004l) and lNovikov et all d2006l) . The 
skeleton is determined by segments parallel to the gradient of the 
field, connecting saddle points to local maxima. Calculating the 
skeleton involves interpolation and smoothing the point distri- 
bution, which introduces an extra parameter, which is the band- 
width of the kernel function used to estimate the density field 
from the point distribution, typically a Gaussian function. This is 
generally the case for most of the density-based approaches. The 
skeleton method was first applied for two-dimensional maps, 
an approach to stud y the cosmic microwave sky background 
(Eriksen et al. 2004). The method was adapted for 3-D maps 
dSousbie e t al. 2008a) and was ap plied to the Sloan Digital Sky 
Survey by ISousbie et al.l ([2008b), providing by means of the 
length of the skeleton, a good distinguishing tool for the anal- 
ysis of the filamentary structures. The formalism has recently 




been further developed and applie d to study the evolut ion of fil- 
amentary structure in simulations (Sou sbie et al.l f2009). 

Anothe r approach is t hat of lArag on-Calv o et aT] (2007a) 
(see also Arago n^Calvol d2007l) ). They use the Delaunay 
Triangulation Field Estimator (DFTE) to reconstruct the den- 
sity field for the galaxy distribution, and apply the Multiscale 
Morphology Filter (MMF) to identify vario us structures, as for 
instan ce clusters, walls, filaments and voids ( Ara gon-Calvo et ail 
2007b). As a further development, this group has used the wa- 
tersh ed algorithm to describe th e global properties of the density 
field dAragon-Calvo et alj|2008l) . 

A new direction is to use the s econd-order prop erties (the 
Hessian matrix) o f the density field ( Bond et aTl l2009) or the de- 
formation tensor dForero-Romero et alj 20081) . As is shown in 
these papers, this allows them to trace and classify different fea- 
tures of the fields. 

Our approach does not introduce the density estimation step; 
we consider the galaxy distribution as a marked point pro- 
cess. In an earlier paper dStoica et al.ll2005bl) . we proposed to 
use an automated method to trace filaments for realisations of 
marked point processes, which has been shown to work well 
for t he detection of road networks in remote sensing situa- 
tions dLacoste et al]2005l:lsto"ica et al.l2002ll2004l) . This method 
is based on the Candy model, a marked point process where seg- 
ments serve as marks. The Candy model can be applied to 2-D 
filaments, and we tested it on simulated galaxy distributions. The 
filaments we found delineated well the filaments detected by eye. 

Based on our previous experience with the Candy model, we 
generalised the approach for three dimensions. As the interac- 
tions between the structure elements are more complex in three 
dimens ions, we had to defin e a more complex model, the Bisous 
model dStoica et alj|2005ab . This model gives a general frame- 
work for the construction of complex patterns made of simple 
interacting objects. In our case, it can be seen as a generalisation 
of the Candy model. We applied the B isous model to trac e and 
describe the filaments in the 2dFGRS dStoica et alj |2007b) and 
demonstrated that it works well. 



In the paper cited above we chose the observational samples 
from the main magnitude-limited 2dFGRS catalogue, selecting 
the spatial regions to have approximately constant spatial den- 
sities. But a strict application of the Bisous process demands a 
truly constant spatial density (intensity). In this paper, we will 
apply the Bisous process to compare the observational data with 
mock catalogues, specially built to represent the 2dFGRS sur- 
vey. To obtain strict statistical test results, we use here volume- 
limited subsamples of the 2dFGRS and of the mock catalogues. 
We trace the filamentary network in all our catalogues and com- 
pare its properties. 

2. Mathematical tools 

In this section we describe the main tools we use to study the 
large-scale filaments. The key idea is to see this filamentary 
structure as a realisation of a marked point process. Under this 
hypothesis, the cosmic web can be considered as a random con- 
figuration of segments or thin cylinders that interact, forming a 
network of filaments. Hence, the morphological and quantitative 
characteristics of these complex geometrical objects can be ob- 
tained by following a straightforward procedure: constructing a 
model, sampling the probability density describing the model, 
and, finally, applying the methods of statistical inference. 

We have given a more detailed de scription of these methods 
in a previous paper dStoica et al.ll2007bl) . 



2. 1 . Marked point processes 

A popular model for the spatial distribution of galaxies is a point 
process on K (a compact subset of R 3 , the cosmologist's sample 
volume), a random configuration of points k = [k\, . . . , k„}, lying 
in K. Let v(K) be the volume of K. 

We may associate characteristics or marks to the points. 
For instance, to each point in a configuration k, shape param- 
eters describing simple geometrical objects may be attached. 
Let (M, At, Vm) be the probability measure space defining these 
marks. A marked or object point process on K x M is the 
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random configuration y = {(kt, m\), (k2, m.2), . . . ,(k n ,m„)}, with 
y, = (kj, mi) e K x M for all i = 1, . . . ,n in a way that the lo- 
cations process is a point process on K. For our purposes, the 
point process is considered finite and simple, i.e. the number of 
points in a configuration is finite and k t + kj, for any i, j so that 
1 < i, j < n. 

In case of the simplest marked point process, the objects do 
not interact. The Poisson object point process is the most appro- 
priate choice for such a situation. This process chooses a number 
of objects according to a Poisson law of the intensity parameter 
v(K), gives a random independent location to each object uni- 
formly in K and a random shape or mark chosen independently 
according to vm- The Poisson object point process has the great 
advantage that it can be described by analytical formulae. Still, it 
is too simple whenever the interactions of objects are to be taken 
into account. 

The solution to the latter problem is to specify a probabil- 
ity density p(y) that takes into account interactions between the 
objects. This probability density is specified with respect to the 
reference measure given by the Poisson object point process. 
There is a lot of freedom in building such densities, provided that 
they are integrable with respect to the reference measure and are 
locally stable. This second condition requires that there exists 
A > so that p(y U \(k, m)})/p(y) < A for any (k, m) e Kx M. 
Local stability implies integrability. It is also an important con- 
dition, guaranteeing that the simulation algorithms for sampling 
such models have good convergence properties. 

For further reading and a comprehensive mathematical 
presentation of object point proce s ses, we recommend the 
monographs by van Lieshputl (l2000h. lM0ller & Waagepetersen 
d2003l) . IStovan et al.ldl995h Tand lIllian et al.l(l2008l) . 

2.2. Bisous model 

In this section, we shall describe the probability density of the 
Bisous model for the network of cosmic filaments. The Bisous 
model is a marked point process th at was designed to generat e 
and analyse random spatial patterns (IStoica et al.l 2005a, 2007b). 

Random spatial patterns are complex geometrical structures 
composed of rather simple objects that interact. We can describe 
our problem as follows: in a region K of a finite volume, we ob- 
serve a finite number of galaxies d = {d\,d%,---, d r ). The posi- 
tions of these galaxies form a complicated filamentary network. 
Modelling it by a network of thin cylinders that can get con- 
nected and aligned in a certain way, a marked point process - 
the Bisous model - can be built in order to describe it. 

A random cylinder is an object characterised by its centre k 
and its mark giving the shape parameters. The shape parameters 
of a cylinder are the radius r, the height h and the orientation 
vector a). We consider the radius and height parameters as fixed, 
whereas the orientation vector parameters a> = <p{r], r) are uni- 
formly distributed on M - [0, 2n) x [0, 1] so that 



(Vr 



s(j]), Vl - r 2 sin(?7), t). 



(1) 



For our purposes, throughout this paper the shape of a cylin- 
der is denoted by s(y) = s(k, r, h, oS), which is a compact subset 
of R 3 of a finite volume v(s(y)). The shape of a random cylinder 
configuration y is defined by the random set Z(y) = U VEy s(y). 

A cylinder {k, cS) has q — 2 extremity rigid points. We centre 
around each of these points a sphere of the radius r a . These two 
spheres form an attraction region that plays an important role 
in defining connectivity and alignment rules for cylinders. We 
illustrate the basic cylinder in Fig. [2] where it is centred at the 
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attraction region 
Fig. 2. A thin cylinder that generates the filamentary network. 

coordinate origin and its symmetry axis is parallel to Ox. The 
coordinates of the extremity points are 



((-!)»"(- +r a ), 0,0), 



u e {1,2} 



(2) 



and the orientation vector is a> = (1, 0, 0). 

The probability density for a marked point process based 
on random cylinders can be written using the Gibbs modelling 
framework: 

exp[-E/(y|0)] 



p(y\o) 



(3) 



where a is the normalising constant, 9 is the vector of the model 
parameters and U(y\9) is the energy function of the system. 

Modelling the filamentary network induced by the galaxy 
positions needs two assumptions. The first assumption is that 
locally, galaxies may be grouped together inside a rather small 
thin cylinder. The second assumption is that such small cylinders 
may combine to extend a filament if neighbouring cylinders are 
aligned in similar directions. 

Following these two ideas the energy function given by ([3]) 
can be specified as: 



U(y\0) = U A (y\6) + C/;(y|0) 



(4) 



where Ud(y\6) is the data energy and Ui(y\d) is the interaction 
energy, associated to the first and second assumptions above, re- 
spectively. In fact, it is perfectly reasonable to think that the data 
energy is the reason that the cylinders in the galaxy field are po- 
sitioned just so, and that the interaction energy is the main factor 
which causes the cylinders to form filamentary patterns. 

2.3. Data energy 

The data energy of a configuration of cylinders y is defined as the 
sum of the energy contributions corresponding to each cylinder: 



U A (y\6) = -J] v00 



(5) 



vey 



where v(-) is the potential function associated to a cylinder that 
depends on d and the model parameters. 

The cylinder potential is built taking into account local cri- 
teria such as the density, spread and number of galaxies. To for- 
mulate these criteria, an extra cylinder is attached to each cylin- 
der y, with exactly the same parameters as y, except for the ra- 
dius which equals 2r. Let s(y) be the shadow of s(y) obtained by 
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Fig. 3. Two-dimensional projection of a thin cylinder with its 
shadow within a pattern of galaxies. 

the subtraction of the initial cylinder from the extra cylinder, as 
shown in Fig. [3] Then, each cylinder y is divided in three equal 
volumes along its main symmetry axis, and we denote by S\(y), 
S2OO and si(y) their corresponding shapes. 

The local density condition verifies that the density of galax- 
ies inside s(y) is higher than the density of galaxies in s(y), and 
it can be expressed as follows: 

n(d n s(y))/v(s(y)) > n(d n %))/v(S(y)), (6) 

where n(d fl s(y)) and n(d n s{y)) are the numbers of galaxies 
covered by the cylinder and its shadow, and v(s(y)) and v(s(y)) 
are the volumes of the cylinder and its shadow, respectively. 

The even location of the galaxies along the cylinder main 
axis is ensured by the spread condition, which is formulated as 

3 

f[n(dn Si (y))>0, (7) 

i=l 

where n(d n Si(y)) is the number of galaxies belonging to Si(y). 

If both these conditions are fulfilled, then v(y) is given by 
the difference between the number of galaxies contained in the 
cylinder and the number of galaxies contained in its shadow: 

v(y)(dns(y))-n(dns(y)). (8) 

Whenever any of the previous conditions is violated, a positive 
value v max is assigned to the potential of a cylinder. 

A segment which does not fulfill the required conditions can 
still be integrated into the network by the parameter v max . This 
should result in more complete networks and better mixing prop- 
erties to the method. 

We note that we have chosen cylinders as the objects here in 
order to trace filaments in the galaxy distribution. Such objects 
are tools at our disposal and any object can be chosen; as an 
example. lStoica et"al] (12005 a) have built systems of flat elements 
(walls) and of regular polytopes (galaxy clusters), based on the 
Bisous process. 

2.4. Interaction energy 

The interaction energy takes into account the interactions be- 
tween cylinders. It is the model component ensuring that the 
cylinders form a filamentary network, and it is given by 

2 

Ui(y\0) = -n K {y) log y K - J] n s (y) log y s , (9) 




Fig. 4. Two-dimensional representation of interacting cylinders. 



where n K is the number of repulsive cylinder pairs and n s is the 
number of cylinders connected to the network through s extrem- 
ity points. The variables log y K and log y s are the potentials as- 
sociated to these configurations, respectively. 

We define the interactions that allow the configuration of 
cylinders to trace the filamentary network below. To illustrate 
these definitions, we show an example configuration of cylinders 
(in two dimensions) in Fig. |4] 

Two cylinders are considered repulsive, if they are rejecting 
each other and if they are not orthogonal. We declare that two 
cylinders y\ = {k\ , oj\) and yz = (£2, 0)2) reject each other if their 
centres are closer than the cylinder height, d{k\,k2) < h. Two 
cy linders are considered to be orthogonal if | oyOfy \ < t ± , where ■ 
is the scalar product of the two orientation vectors and t ± e (0, 1 ) 
is a predefined parameter. So, we allow a certain range of mutual 
angles between cylinders that we consider orthogonal. 

Two cylinders are connected if they attract each other, do 
not reject each other and are well aligned. Two cylinders attract 
each other if only one extremity point of the first cylinder is con- 
tained in the attraction region of the other cylinder. The cylinders 
are "magnetised" in the sense that they cannot attract each other 
through extremity points having the same index. Two cylinders 
are well aligned if o)\ ■ o>2 > 1 - T\\, where t\\ e (0, 1) is a prede- 
fined parameter. 

Take now a look at Fig. [4] According to the previous defini- 
tions, we observe that the cylinders cl, q and C3 are connected. 
The cylinders c\ and C3 are connected to the network through one 
extremity point, while C2 is connected to the network through 
both extremity points. The cylinders C4 and C5 are not connected 
to anything - c\ is not well aligned with c%, the angle between 
their directions is too large, and C5 is not attracted to any other 
cylinder. It is important to notice that the cylinders C3 and c\ are 
not interacting - they are wrongly 'polarised', their overlapping 
extremity points have the same index. The cylinder C5 is reject- 
ing the cylinders ci and C4 (the centres of these cylinders are 
close), but as it is rather orthogonal both to cl and C4, it is not 
repulsing them. The cylinders C2 and c\ reject each other and are 
not orthogonal, so they form a repulsive pair. 

Altogether, the configuration at Fig. [4] adds to the interac- 
tion energy contributions from three connected cylinders (one 
doubly-connected, C2, and two single-connected, c\ and C3), and 
from one repulsive cylinder pair (C2-C4). 

The complete model (01 that includes the definitions of the 
data energy and of the interaction energy given by (0 and (0 
is well defined for parameters as v max < 0, yo>7i>72 > and 
y K € [0,1]. The definitions of the interactions and the parameter 
ranges chosen ensure that the complete mo del is locally stabl e 
and Markov in the sense of Ripley-Kelly (Stoi ca et al.l 12005a). 
For cosmologists it means that we can safely use this model 
without expecting any dangers (numerical, convergence, etc.). 
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2.5. Simulation 

Several Monte Carlo techniques are available to simulate marked 
point processes: spatial birth-and-death processes, Metropolis- 
Hastings algorithms, reversible jump dynamics or more re- 
cent exact simulation techniques (jGever & M0llerlll994i|Geverl 
19991: iGreenll 19951: iKendall & MdUdboOOt Ivan Lieshoutf2000: 



van Lieshout & Stoicall2006l:lPrestonHl977l) . 



In this paper, we need to sample from the joint probability 
density law p(y, 9). This is done by using an iterative Monte 
Carlo algorithm. An iteration of the algorithm consists of two 
steps. First, a value for the parameter 9 is chosen with respect to 
p(ff). Then, conditionally on 0, a cylinder patter n is sampled from 
p(y[ff ) using a Me tropolis-Hastings algorithm dGever & M 0ller 
1994: lGeverlll999l) . 

The Metropolis-Hastings algorithm for sampling the condi- 
tional law p(y\0) has a transition kernel based on three types 
of moves. The first move is called birth and proposes to add a 
new cylinder to the present configuration. This new cylinder can 
be added uniformly in K or can be randomly connected with 
the rest of the network. This mechanism helps to build a con- 
nected network. The second move is called death, and proposes 
to eliminate a randomly chosen cylinder. The role of this second 
move is to ensure the detailed balance of the simulated Markov 
chain and its convergence towards the equilibrium distribution. 
A third move can be added to improve the mixing properties of 
the sampling algorithm . This third move is called change; it ran- 
domly chooses a cylinder in the configuration and proposes to 
"slightly " change its parameters using simple probability dis- 
tributions. For specific details c oncerning the implemen tation 
of this dynamics we recommend iLieshout & Sto ica (2003]) and 
IStoica etal.ld2005al) . 

Whenever the maximisation of the joint law p{y, 8) is 
needed, the previously described sampling mechanism can be 
integrated into a simulated annealing algorithm. The simulated 
annealing algorith m is built by sampling from p(y, ff) 1 ^ 7 , while T 
goes slowly to zero. lStoica et al.l (12005a) proved the convergence 
of such simulated annealing for simulating marked point pro- 
cesses, when a logarithmic cooling schedule is used. According 
to this result, the temperature is lowered as 



To 



log n + 1 



(10) 



we use Tq = 10 for the initial temperature. 



2.6. Statistical inference 

One straightforward application of the simulation dynamics is 
the estimation of the filamentary structure in a field of galaxies 
together with the parameter estimates. These estimates are given 
by : 

(y,6) = arg max /?(y, 6») = argmax p(y\6)p(9) 



. / U d (y\0) + Ui(y\0) U p (8) 
arg mm < 1 

5 nx<r \ a(6) a p (0) 



(11) 



where a{0) is the normalising constant, p(ff) = 
exp[-U p (ff)]/a p (0) is the prior law for the model parame- 
ters and *P is the model parameter's space. 

However, the solution we obtain is not unique. In practice, 
the shape of the prior law p{6) may influence the solution, mak- 
ing the result to look more random compared with a result ob- 
tained for fixed values of parameters. Therefore, it is reasonable 



to wonder how precise the estimate is, that is if an element of the 
pattern really b elongs to the patter n, or if its presence is due to 
random effects dStoica et alj|2007allbl) ) 

For compact subregions % c K, we can compute or give 
Monte Carlo approximations for average quantities such as 



E (Y>0) [/CR,Z(Y))], 



(12) 



where E denotes the expectation value over the data and model 
parameter space, and fCR, •) is a real measurable function with 
respect to the cr-algebra associated to the configuration space of 
the marked point process. 

If /CR,Z(Y)) = C Z(Y)} (where 1 is the indicator 

function), then the expression (Tl2l represents the probability of 
how often the considered model includes or visits the region %. 
Furthermore, if K is partitioned into a finite collection of small 
disjoint cells {??i,??2> ■ • ■ ,9lq}, then a visit probability map can 
be obtained. This map is given by the partition together with the 
value Pi = E £ Z(Y)}] associated to each cell. The map is 
defined by the model and by the parameters of the simulation 
algorithm; its resolution is given by the cell partition. 

The sufficient statistics of the model (0 - the interaction pa- 
rameters n K and n s , s = (0, 1,2) - describe the size of the fila- 
mentary network and quantify the morphological properties of 
the network. Therefore, they are suitable as a general character- 
isation of the filamentarity of a galaxy catalogue. This renders 
the comparison of the networks of different regions and/or dif- 
ferent catalogues perfectly possible. Here, we use the sufficient 
statistics to characterise the real data and the mock catalogues. 

The visit maps show the location and configuration of the 
filament network. Still, the detection of filaments and this verifi- 
cation test depend on the selected model. It is reasonable to ask 
if these results are obtained because the data exhibits a filamen- 
tary structure or just because of the way the model parameters 
are selected. 

The sufficient statistics can be used to build a statistical test 
in order to answer the previous question. For a given data cat- 
alogue, samples of the model are obtained, so the means of the 
sufficient statistics can be then computed. The same operation, 
using exactly the same model parameters, can be repeated when- 
ever an artificial point field - or a synthetic data catalogue - is 
used. If the artificial field is the realisation of a binomial point 
process having the same number of points as the number of 
galaxies in the original data set, the sufficient statistics are ex- 
pected to have very low values - there is no global structure 
in a such binomial field. If the values of the sufficient statis- 
tics for these binomial fields were large, this would mean that 
the filamentary structure is due to the parameters, not to the 
data. Comparing the values obtained for the original data sets 
with Monte Carlo envelopes found for artificial point fields, we 
can compute Monte Carlo /^-values to test the hypothesis of the 
existe nce of the filamenta ry structure in the original data cata- 
logue 



nee oi the m amentary 
(IStoica et alj2007albT) . 



3. Data 

We apply our algorithms to a real data catalogue and compare 
the results with those obtained for 22 mock catalogues, specially 
generated to simulate all main features of the real data. 

3.1. Observational data 

At the moment there are two large galaxy redshift (spatial posi- 
tion) catalogues that are natural candidates for a filament search. 
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2dF (volume-limited) 




Fig. 5. The geometry and coordinates of the data bricks. Right panel: all 2dF galaxies inside a large contiguous area of the northern 
wedge are shown in blue (up to a depth of 250 /r'Mpc), galaxies that belong to the NGP250 sample are depicted in red. The left 
panel corresponds to the volume-limited sample. Both diagrams are shown to scale; there is very little overlap (only 15.6/i _I Mpc in 
depth) between the NGP250 and the 2dF volume-limited bricks. The coordinates are in units of l/z^Mpc. 



When the work reported here was carried out a few years ago, 
the best available redshift catalogue to study the morphology of 
the galaxy distribut ion was the 2 degr ee Field Galaxy Redshift 
Survey (2dFGRS, IColless et all 1200 ll) ; the much larger Sloan 
Digita l Sky Survey (SPSS) (see the description of its final sta- 
tus in lAbazaiian & Sloan Digital Sky Survey! 120081) was yet in 
its first releases. Also, only the 2dFGRS had a collection at that 
time of mock catalogues that were specially generated to mimic 
the observed data. So this study is based on the 2dFGRS; we 
shall certainly apply our algorithms to the SDSS in the future, 
too. 

The 2dFGRS covers two separate regions in the sky, the NGP 
(North Galactic Cap) strip, and the SGP (South Galactic Cap) 
strip, with a total area of about 1500 square degrees. The nomi- 
nal (extinction-corrected) magnitude limit of the 2dFGRS cata- 
logue is bj = 19.45; reliable redshifts exist for 221,414 galaxies. 
The effective depth for the catalogue is about z — 0.2 or a comov- 
ing distance of D — 572 hr x Mpc for the standard cosmological 
model with £2 mattel = 0.3 and Qa = 

The 2dFGRS catalogue is a flux-limited catalogue and there- 
fore the density of galaxies decreases with distance. For a statis- 
tical analysis of such surveys, a weighting scheme that compen- 
sates for the missing galaxies at large distances has to be used. 
However, such a weighting is suitable only for specific statistical 
problems, as e.g. the calculation of correlation functions. When 
studying the local structure, such a weighting cannot be used; it 
would only amplify the shot noise. 

We can eliminate weighting by using volume-limited sam- 
ples. The 2dF team has ge nerated these for scaling studies 
(see, e.g.. lCroton et al.l l2004); they kindly sent these samples to 
us. The volume-limited samples are selected in one-magnitude 
intervals; we chose as our sample the one with the largest 
number of galaxies for the absolute magnitude interval M/, e 



1 Here and below h is the dimensionless Hubble constant, H = h - 100 
km s -1 Mpc" 1 . 



[-19.0,-20.0]. The total number of galaxies in this sample is 
44,713. 



The NGP250 sample is g ood for detecting fi laments, as 
shown in our previous paper dStoica et alj I2007b1) . But this 
sample is magnitude-limited (not volume-limited), therefore the 
number of galaxies decreases with depth, because only galaxies 
with an apparent magnitude exceeding the survey cutoff are de- 
tected. Since we can perform statistical tests only when our base 
point process is a Poisson process, implying approximately con- 
stant mean density with depth, we have to use volume-limited 
samples in our study. Moreover, the mocks that have been built 
for the 2dFGRS are already volume-limited, and cannot be com- 
bined into a magnitude-limited sample because of their different 
depths. Thus, if we want to compare the observed filaments with 
those in the mock samples, we are forced to use volume-limited 
samples. 



The borders of the two volumes covered by the sample are 
rather complex. As our algorithm is recent, we do not yet have 
the estimates of the border effects, and we cannot correct for 
these. So we limited our analysis to the simplest volumes - 
bricks. As the southern half of the galaxy sample has a convex 
geometry (it is limited by two conical sections of different open- 
ing angles), the bricks which are possible to cut from there have 
small volumes. Thus we used only the northern data which have 
a geometry of a slice, and chose the brick of a maximum volume 
that could be cut from the slice. We will compare the results 
obtained for this sample (2dF) below with those obtained for a 
smaller sample in a previous paper (NGP250); the geometry and 
galaxy content of these two data sets is described in TableQ] We 
have shifted the origin of the coordinates to the near lower left 
corner of the brick; the geometry of the bricks (both the 2dF and 
NGP250 sample) is illustrated in Fig. [5] 
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sample 




depth 


width 


height 


d 


2dF 


8487 


133.1 


254.0 


31.1 


5.0 


NGP250 


7588 


88.6 


169.1 


20.7 


3.4 



Table 1. Galaxy content and geometry for the data bricks (sizes 
are in /z 'Mpc). Ng & \ is the number of galaxies in the sample, 
and d is the mean distance between galaxies in the sample. 



3.2. Mock catalogues 

We compare the observed filaments with those built for mock 
galaxy catalogues which try to simulate the observations as 
closely as possible. The construct i on of these catalogues is de- 
scribed in detail by Norberg et al. (2002); we give a short sum- 
mary here. The 2dF mock catalogues ar e based on the "Hubble 
Volume" simulation dColberg et alj2000b . an N-body simulation 
of a 3/T 1 Gpc cube of 10 9 mass points. These mass points are 
considered as galaxy candidates and are sampled according to a 
set of rules that include: 

1. Biasing: the probability for a galaxy to be selected is cal- 
culated on the basis of the smoothed (with a cr = 2/z~'Mpc 
Gaussian filter) fin al density. This p robability (biasing) is ex- 
ponential (rule 2 o flCole et al.ll998l) . with parameters chosen 
to reproduce the observed power spectrum of galaxy cluster- 
ing. 

2. Local structure: the observer is placed in a location similar 
to our local cosmological neighbourhood. 

3. A survey volume is selected, following the angular and dis- 
tance selection factors of the real 2dFGRS. 

4. Luminosity distribution: luminosities are assigned to galax- 
ies according to the observed (Schechter) luminosity distri- 
bution; k + e-corrections are added. 

These "ideal" catalogues are then combined with observational 
errors to produce the final mock catalogues: 

1. Galaxy redshifts are modified by adding random dynamical 
velocities. 

2. Observational random errors are added to galaxy magni- 
tudes. 

3. Based on galaxy positions, survey incompleteness factors are 
calculated. 

These catalogues are as close to the observed catalogues as cur- 
rently possible - the spatial coverage, galaxy density, clustering, 
luminosities and observational errors are the same. So, we expect 
that the filamentary structure of the mock catalogues should be 
close to the ones we observe. 



4. Filaments 

4. 1 . Experimental setup 

As described above, we use the data sets drawn from the galaxy 
distribution in the Northern subsample of the 2dFGRS survey 
and from the 22 mock catalogues. For mock catalogues, we use 
the same absolute magnitude range and cut the same bricks as 
for the 2dFGRS survey. 

The sample region K is the brick. In order to choose the 
values for the dimensions of the cylinder we use the physical 
dimensions of the galaxy filaments that have been observed in 



more detail ( Pimb blet & Drin kwater 2004); w e used the same 
values also in our previous paper (Stoi ca et al.ll2007bh : a radius 
r = 0.5 and a height h = 6.0 (all sizes are given in /i^'Mpc). The 
radius of the cylinder is close to the minimal one can choose, 
taking into account the data resolution. Its height is also close 
to the shortest possible, as our shadow cylinder has to have a 
cylindrical geometry, too (the ratio of its height to the diameter 
is presently 3:1). We choose the attraction radius as r a = 0.5, 
giving the value 1.5 for the maximum distance between the con- 
nected cylinders, and for the cosines of the maximum curvature 
angles we choose T\\ — t x = 0.15. This allows for a maximum of 
» 30° between the direction angles of connected cylinders and 
considers the cylinders to be orthogonal, if the angle between 
their directions is larger than * 80°. 

The model parameters (r, h, r a ) influence the detection re- 
sults. If they are too low, all network will be considered as made 
of clusters, so no filaments will be detected. If they are too high, 
the detected filaments will be too wide and/or too sparse, and 
precision will be lost. Still, this makes the visit maps an inter- 
esting tool, since, in a certain manner, they average the detection 
result. In this work, the (r, h, r a ) parameters were fixed after a vi- 
sual inspection of the data and of different projections outlining 
the filaments. 

The marked point process-based methodology allows us to 
introduce these parameters as marks or priors characterised by 
a probability density, hence the detection of an optimal value 
for these parameters is then possible. Knowledge based on as- 
tronomical observations could be used to set the priors for such 
probability densities. 

For detecting the scale-length of the cylinders or for obtain- 
ing indications about its distribution, we may use visit maps to 
build cell hypothesis tests to see which the most probable h of 
the cylinder passing through this cell could be. This may require 
also a refinement of the data term of the model. 

For the data energy, we limit the parameter domain by M max = 
[-25,20]. For the interaction energy, we choose the parameter 
domain as follows: logyo £ [—12.5, —7.5], log y\ e [-5,0] and 
log 72 6 [0,5]. The hard repulsion parameter is = 0, so the 
configurations with repulsing cylinders are forbidden. The do- 
main of the connection parameters was chosen in a way that 
2-connected cylinders are generally encouraged, 1 -connected 
cylinders are penalised and 0-connected segments are strongly 
penalised. This choice encourages the cylinders to group in fil- 
aments in those regions where the data energy is good enough. 
Still, we have no information about the relative strength of those 
parameters. Therefore, we have decided to use the uniform law 
over the parameter domain for the prior parameter density p(9). 

4.2. Observed filaments 

We ran the simulated annealing algorithm for 250,000 iterations; 
samples were picked up every 250 steps. 

The cylinders obtained after running the simulated anneal- 
ing outline the filamentary network. But as simulated annealing 
requires an infinite number of iterations till convergence, and 
also because of the fact that an infinity of solutions is proposed 
(slightly changing the orientation of cylinders gives us another 
solution that is as good as the original one), we shall use visit 
maps to "average" the shape of the filaments. 

Figure[6]shows the cells that have been visited by our model 
with a frequency higher than 50%, together with the galaxy field. 
Filamentary structure is seen, but the filaments tend to be short, 
and the network is not very well developed. For comparison, we 
show a similar map for the smaller volume (NGP250), where the 
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Fig. 7. Filaments in the main data set 2dF, for the rescaled basic 
cylinder. 

galaxy density is about three times higher. We see that the effec- 
tiveness of the algorithm depends strongly on the galaxy density; 
too much of a dilution destroys the filamentary structure. 

As galaxy surveys have different spatial densities, this prob- 
lem should be addressed. The obvious way to do that is to rescale 
the basic cylinder. First, we can do full parameter estimation, 
with cylinder sizes included. Second, we can use an empirical 
approach, choosing a few nearby well-studied filaments, remov- 
ing their fainter galaxies and finding the values for h and r that 
are needed to keep the filaments together. 

But this needs a separate study. We will use here a simple 
density-based rescaling - as the density of the 2dF sample is 
three times lower than that of the smaller volume we rescaled the 
cylinder dimensions by 3 l ^ = 1 .44. The filamentary network for 
this case is shown in Fig. [7] This is better developed, but not as 
well delineated as that for the smaller volume. 

This rescaling assumes that the dilution is Poissonian, and 
there is no luminosity-density relation. Both those assumptions 
are wrong. Our justification of the scaling used here is that this 
is the simplest scaling assumption, and the dimensions of the 
rescaled cylinder (r = 0.72, h = 8.6) do not contradict observa- 
tions. Also, the filamentary networks found with rescaled cylin- 
ders and the visit maps seem to better trace the filaments seen by 
eye. We realise that the scaling problem is important, and will 
return to it in the future. 



Data 


n, 


»/ 


n e 


MOCK8 (A) 


57 


5 


2.9 


MOCK8 


30 


5 


3.0 


MOCK 16 (A) 


107 


24 


5.4 


MOCK16 


82 


18 


4.1 


2dF (A) 


86 


13 


4.3 


2dF 


65 


14 


3.3 


NGP 250 


191 


21 


9.6 



Table 2. Line-of-sight cylinders in the data (2dF and NGP250) 
and in two mocks, 8 and 16. The index (A) labels the rescaled 
case with a larger cylinder. The column n, shows the total num- 
ber of cylinders in the network, n f is the number of line-of-sight 
cylinders, and n e is the expected number of line-of-sight cylin- 
ders, in case of the isotropic cylinder orientation. 



As we work in the redshift space, the apparent galaxy dis- 
tribution is distorted by peculiar velocities in groups and clus- 
ters that produce so-called 'fingers-of-god', structures that are 
elongated along the line-of-sight. These fingers may masquer- 
ade as filaments for our procedure. To estimate their influence, 
we first found the cylinders using the simulated annealing algo- 
rithm. The cylinders along the line-of-sight may be caused by 
the finger-of-god effect. A simple test was implemented, check- 
ing if the module of the scalar product between the direction of 
the symmetry axis of the cylinder and the direction of the line- 
of-sight (| cos 0|, where <\> is the angle between these directions) 
is close to 1 (greater than 0.95). 

The results are shown in Table [2] and in Fig. [8] The Table 
shows the total number of cylinders n t , the number of line- 
of-sight cylinders n f, and the expected number of cylinders n e 
(assuming an isotropic distribution of cylinders, n e = 0.05«,). 
Figure [8] compares the network of all cylinders (left panel) with 
the location of the line-of-sight cylinders (right panel). The fig- 
ures for all other catalogues listed in Table|2]appear to be similar. 

Clearly, our method detects such fingers, although the num- 
ber of extra cylinders (fingers-of-god) is not large. There are at 
least two possibilities to exclude them. The first is to use a group 
catalogue, where fingers are already compressed, instead of a 
pure redshift space catalogue. Another possibility is to modify 
our data term, checking for the cylinder orientation, and to elim- 
inate the fingers within the algorithm. We will test both possibil- 
ities in future work. 
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Fig. 8. The full cylinder network (left panel) and line-of-sight cylinders (right panel), for the 2dF data brick, with a rescaled cylinder. 
The coordinates are in units of l/z^'Mpc. 



A problem that has been addressed in most of the papers 
about galaxy filaments is the typical filament length (or the 
length distribution). As our algorithm allows branching of fil- 
aments (cylinders that are approximately orthogonal), it is dif- 
ficult to separate filaments. We have tried to cut the visit maps 
into filaments, but the filaments we find this way are too short. 
We may advance here applying morphological operations to visit 
maps. Still, this kind of operation needs at least a good mathe- 
matical understanding of the "sum" of all the cells forming the 
visit maps. 

Another possibility is to use the cylinder configurations 
for selecting individual filaments. These configurations can be 
thought of as a (somewhat random) filament skeletons of visit 
maps. We have used them to find the distributions of the suffi- 
cient statistics, and these configurations should be good enough 
to estimate other statistics, as filament lengths. We will certainly 
try that in the future. 

There are problems where knowing the typical filament 
length is very important, as in the search for missing baryons. 
These are thou ght to be hidden as warm intergalactic gas 
(WHIM, see, e.g lViel et ail d2005l) ). In order to detect this gas, 
the best candidates are galaxy filaments that lie approximately 
along the line-of-sight; knowing the typical length of a filament 
we can predict if a detection would be possible. 

Concerning the length of the entire filamentary network, the 
most direct way of estimating it is to multiply the number of 
cylinders with h. In this case, the distribution of the length of 
the network is given by the distribution of the sum of the three 
sufficient statistics of the model. The precision of the estimator 
is related to the precision of h. Another possible estimator can 
be constructed using h + r c instead of h. The sam e construction 
can be used even if different cy linders are used lLacoste et alj 
d2005l) : IStoica et all (120021 120041) . As for the sufficient statistics, 
the distribution of the length of the network may be derived using 
Monte Carlo techniques. 

We note that we can easily find the total volumes of fila- 
ments, counting the cells on the visiting map. As an example, 
for the cases considered here, the relative filament volumes are 
1.8 ■ 10~ 3 (2dF, smaller cylinder), 3.3 • 10~ 3 (2dF, rescaled cylin- 
der), and 1.6 ■ 10 -2 (NGP250). 

4.3. Statistics 

As we explained before, in order to compare the filamentarity 
of the observed data set (2dF) and the mocks, we had to run the 
Metropolis-Hastings algorithm at a fixed temperature T = 1.0 
(sampling from p(y, ff)). The algorithm was run for 250, 000 it- 



Data sets 


Sufficient statistics 




n 


»i 


2dF 


1.94 


5.30 


11.66 


MOCK 1 


2.53 


5.62 


13.16 


MOCK 2 


0.48 


6.20 


7.52 


MOCK 3 


1.29 


4.65 


6.88 


MOCK 4 


1.55 


9.33 


15.45 


MOCK 5 


1.45 


10.63 


9.24 


MOCK 6 


0.38 


6.21 


8.96 


MOCK 7 


1.36 


9.08 


8.12 


MOCK 8 


0.18 


6.91 


4.27 


MOCK 9 


2.07 


6.09 


9.76 


MOCK 10 


1.62 


4.40 


11.91 


MOCK 11 


1.28 


4.65 


10.14 


MOCK 12 


2.65 


7.97 


11.25 


MOCK 13 


0.73 


6.48 


7.08 


MOCK 14 


0.36 


7.30 


16.44 


MOCK 15 


0.98 


4.36 


8.47 


MOCK 16 


2.75 


11.04 


22.88 


MOCK 17 


0.30 


5.96 


7.67 


MOCK 18 


2.15 


5.11 


10.44 


MOCK 19 


1.59 


8.02 


10.99 


MOCK 20 


1.27 


8.79 


10.50 


MOCK 21 


2.77 


10.57 


11.06 


MOCK 22 


1.79 


8.10 


17.26 



Table 3. The mean of the sufficient statistics for the data and the 
mocks: n~2 is the mean number of the 2-connected cylinders, n~\ 
is the mean number of the 1 -connected cylinders and no is the 
mean number of the 0-connected cylinders. 



erations, and samples were picked up every 250 iterations. The 
means of the sufficient statistics of the model were computed 
using these samples. As an example, we compare the single- 
temperature visit map for the data with two extreme cases for 
the mocks (8 and 16) in Fig. [9] The obtained results are shown 
in Table [3] 

The MH algorithm was run at a fixed temperature. This al- 
lows us a quantitative comparison of the observed data and the 
mock catalogues through the distributions of the model sufficient 
statistics. 

The box plots of the distributions of the sufficient statistics 
for the mocks and the real data are shown in Fig. [TOj The box 
plots for the mock catalogues are indexed from 1 to 22, and this 
corresponds to the indexes of each catalogue. The box plot cor- 
responding to the real data is indexed 23 and it is coloured dark. 

The distributions of the no statistics are compared in the 
left panel in the Fig. |T0j We recall that the no statistics repre- 
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Fig. 9. Visit maps for two mocks (mock8 - left panel, mockl6 - right panel) and the real data (middle panel), for the rescaled 
cylinder. We show the cells for the 50% threshold. 



Distribution of the no statistics Distribution of the n1 statistics Distribution of the n2 statistics 




12 3 4 5 6 7 8 9 11 13 15 17 19 21 23 1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 



Fig. 10. Comparison of the distributions of the sufficient statistics for the real data (dark boxplot) and the mock catalogues. Box plots 
allow a simultaneous visual comparison in terms of the center, spread and overall range of the considered distributions: the middle 
line of the box corresponds to qo.s, the empirical median of the distribution, whereas the bottom and the top of the box correpond to 
go.25 and <7o.75, the first and the third empirical quartile, respectively. The low extremal point of the vertical line (whiskers) of the box 
is given by 170.25 - 1 -5Aq, where the interquartile range Aq = go.75 - 170.25, and the high extremal point is given by (70.75 + 1 .5 Aq. There 
is a statistical rule-of-thumb stating that the values located outside the interval given by the whiskers may be outliers. These are 
shown by dots - full dots indicate "extreme" outliers, more than 3Aq away from the (70.25 or go.75, and open dots - "mild" outliers, 
closer than 3Aq from these quartiles. As an example, for the Gaussian distribution the outlier region accounts for 0.7% of the total 
probability. 



sents the number of isolated cylinders (no connections). Thus, a 
large number of such cylinders tells us that the network is rather 
fragmented. We see that only the mock catalogue 3 exhibits 
a less fragmented network than the real data. A considerable 
number of mock catalogues has a filamentary network that is 
much more fragmented than the data: the median for these cata- 
logues is clearly much higher than the median for the real data. 
Nevertheless, there are some catalogues which have similar val- 
ues as the data for the median, and also a similar shape for the 
distributions. These mock catalogues are 1, 10, 1 1, 15 and 18. In 
order to see how similar these catalogues are to the real data, 
we ran a Kolmogorov-Smirnov test. The /^-values for the mock 
catalogues 1 and 18 were 0.96 and 0.13, respectively. For the 
mock catalogues 10, 11 and 15 the obtained ^-values were all 
lower than 0.002. Hence, we conclude that among all mock cat- 
alogues, there are only two that are similar to the real data with 
respect to the distribution of the «o statistics. A big majority of 
the mock catalogues exhibits networks that are much more frag- 
mented than the one in the real data. 

The right panel in Fig. \W\ compares the distributions of the 
«2 statistics. The tii statistics represents the number of cylinders 
in a configuration which are connected at both its extremities. 
A configuration with a considerable number of such cylinders 
forms a network made of rather long filaments. So we see that 



mock catalogues 12 and 21 exhibit a network with longer fil- 
aments than in the data. The distribution of the «2 statistics for 
the mock catalogue 22 appears to be very similar to the real data. 
The p-value of the associated Kolmogorov-Smirnov test is 0.09. 
To make a decision concerning the mock catalogues 1, 9, 16 and 
18, a one-sided Kolmogorov-Smirnov test was done, based on 
an alternative hypothesis that the distribution of the «2 statistics 
for the real data may extend farther than the distributions for 
the corresponding mock catalogs. Since the obtained /j-values 
where all very small, we can state that the filamentary network 
of these four mock catalogues is made of shorter filaments than 
those in the real data. So, we conclude that with a few exceptions 
the mock catalogues exhibit a network made of filaments which 
are shorter than the ones in the real data. 

The distributions of the n\ statistics are compared in the mid- 
dle panel of Fig. [10] The n\ statistics gives the number of cylin- 
ders that are connected at only one extremity. Hence, a config- 
uration with a large value of this statistics is a network with 
a rather high density of the filaments. Looking at all the three 
statistics together, we get a rough idea of the topology of the net- 
work. For instance, if both the values of n\ and «2 are high, this 
indicates a network similar to a spaghetti plate or to a tree with 
long branches. Or the other way around, if the n\ and «o statis- 
tics are both high, the network should be similar to a macaroni 
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Data sets 


Sufficient statistics 


n% 


n 


«i 


NGP250 


11.31 


32.76 


56.15 


2dF 


7.13 


6.72 


33.43 


MOCK 8 


1.53 


9.57 


12.64 


MOCK 16 


6.67 


12.48 


37.81 



Table 4. The mean of the sufficient statistics for the data and the 
mocks, for the rescaled basic cylinder. The columns are the same 
as in the previous table. 



plate or to a bush with short branches. This means that the mock 
catalogues 1,4, 14, 16 and 22 produce a network which is much 
more dense (or has more endpoints) than the one in the real data 
catalogue. The box plots for the mock catalogs 18, 19,20 and 
21 are almost identical to the box plot for the real data, show- 
ing that the n\ distributions are similar. This was confirmed by 
the Kolmogorov-Smirnov test. The mock catalogues 10 and 12 
have the same median as the data, while the distributions are 
much more concentrated and symmetrical. The Kolomogorov- 
Smirnov test showed that these distributions differ significantly 
from that for the real data. We conclude that concerning the dis- 
tributions of the «i statistics, four mock catalogues have simi- 
lar distributions to the data, and five others clearly have a much 
more dense network, while the rest of the catalogues produce 
networks that are clearly less dense or have fewer endpoints than 
the network in the real data catalogue. 

In summary, if for a single model characteristic the mock 
catalogues may look similar to the data, taking into account all 
three of them leads to a rather obvious difference between the 
mocks and the observations. Generally, from a topological point 
of view, the networks in the mock catalogues are more frag- 
mented and contain shorter filaments than in the data. As for the 
filament density, the mock catalogues encompass the real data, 
with a large variance. 

To see the influence of rescaling to the sufficient statistics, 
we repeated the procedure with the rescaled cylinder for the data 
and for the mocks 8 and 16. The data are given in Table |4] 

Rescaling the basic cylinder improves the network, but not 
as much as expected - the interaction parameters remain lower 
than those obtained for the NGP250 sample. 

To see if the filamentary network we find is really hidden in 
the data, we uniformly re-distributed the points inside the do- 
main K. Now the points follow a binomial distribution that de- 
pends only on the total number of points. For each (mock) data 
set this operation was done 100 times, obtaining 100 point fields 
accordingly. For each point field the method was launched dur- 
ing 50000 iterations at fixed T — 1.0, while samples were picked 
up every 250 iterations. The model parameters were the same as 
previously described. The mean of the sufficient statistics was 
then computed. The maximum values for the all 100 means for 
each data set are shown in Table 

As we see, the algorithm does not find any connected cylin- 
ders for a random distribution, both the numbers of the 1- 
connected and 2-connected cylinders are strictly zero. Only in 
a few cases the data allow us to place a single cylinder. Thus, the 
filaments our algorithm discovers in galaxy surveys and in mock 
catalogues are real, they are hidden in the data and are not the 
result of a lucky choice of the model parameters. 



Binomial data sets 


Sufficient statistics 


max ri~2 


max no 


max n\ 


MOCK 1 


U 


n no 
u.uz 


n 
u 


MOCK 1 





0.015 





MOCK 3 





0.01 





MOCK 5 





0.015 





MOCK 6 





0.03 





MOCK 7 





0.02 





MOCK 8 





0.015 






Table 5. The maximum of mean of the sufficient statistics over 
binomial fields generated for some mock catalogues (the same 
number of points): max «2 is the maximum mean number of the 
2-connected cylinders, maxni is the maximum mean number 
of the 1 -connected cylinders and max«o is the maximum mean 
number of the 0-connected cylinders. 



5. Discussion, conclusions and perspectives 

In a previous paper dStoica et al.l l2007bb we developed a new 
approach to locate and characterise filaments hidden in three- 
dimensional point fields. We applied it to a galaxy catalogue 
(2dFGRS), found the filaments and described their properties by 
the sufficient statistics (interaction parameters) of our model. 

As there are numerical models (mocks) that are carefully 
constructed to mimic all local properties of the 2dFGRS, we 
were interested to see whether these models also have global 
properties similar to the observed data. An obvious test for that 
is to find and compare the filamentary networks in the data and 
in the mocks. We did that, using fixed shape parameters for the 
basic building blocks for the filaments, and fixed interaction po- 
tentials. These priors had led to good results before. 

In order to strictly compare the observed catalogue and the 
mocks, we had to work with constant-density samples (volume- 
limited catalogues). This inevitably led to a smaller spatial den- 
sity, and the filament networks we recovered were not as good as 
those found in the previous paper. Rescaling the basic cylinder 
helped, but not as much as expected. 

As all the mock samples are selected from a single large- 
volume simulation, they share the same realisation of the ini- 
tial density and velocity fields. The large-scale properties of the 
density field and its filamentarity should be similar in all the 
mocks. The volumes of the mocks are sufficiently high to sup- 
press cosmic noise at the filament scales, and the dimensions 
of the bricks are large, too, except for the third dimension, the 
thickness of the brick. Our bricks are very thin, with a height of 
only 31.1/z _1 Mpc). This can cause a selection of different pieces 
of dark matter filaments and consequently a broad variance in 
the filamentarity of the density field. 

The biasing scheme will also influence the properties of 
mock filaments. As the particle mass in the simulation was 
2.2 • 1O 12 M0, galaxies had to be identified with individual mass 
points, and this makes the biasing scheme pretty random (com- 
pared with later scenarios where galaxies have been built inside 
dark matter subhaloes). Another source of randomness is the ran- 
dom assignment of galaxy luminosities that excludes reproduc- 
ing the well-known luminosity-density relation. As we saw, the 
filaments in a typical mock are shorter, and that can be explained 
by the 'randomisation' of galaxy chains. 

There are several new results in our paper: 

1 . The filamentarity of the real galaxy catalogue, as described 
by the sufficient statistics of our model (the interaction pa- 
rameters), lies within the range covered by the mocks. But 
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the model filaments are, in general, much shorter and do not 
form an extended network. 

2. The filamentarity of the mocks themselves differs much. This 
may be caused both by the specific geometry (thin slice) of 
the sample volume and by the biasing scheme used to popu- 
late the mocks with galaxies. 

3. Finally, we compared our catalogues with the random (bino- 
mial) catalogues with the same number of data points and 
found that these do not exhibit any filamentarity at all. This 
proves that the filaments we find exist in the data. 

Our method does not yield an estimate of the precision of the 
detection. This is an important and far from a trivial problem. For 
instance, even in the ideal situation of an entirely supervised de- 
tection of these filaments (made by hand by a human specialist), 
we may wonder how the obtained result should be validated? 
Another very important difficulty is related to the fact that there 
exists no precise (mathematical) definition of what a galactic fil- 
ament really is. 

The definition we propose, "something complicated made of 
connected small cylinders containing galaxies that are more con- 
centrated in a cylinder than outside it", can be clearly improved 
in order to allow a better local fit for a cylinder. Still, although 
quite simple, this definition allows us a general treatement us- 
ing marked point processes. Very recent work in marked point 
process literature presen ts methodological ideas lea ding to sta- 
tistical model validation (Bad delev et al.l2005ll2008l) . This gives 
hope and perspective to incorporate these ideas into our method. 
This perspective is important because it allows us a global ap- 
preciation of the result. 

The detection test on realisations of binomial point processes 
shows that whenever filaments are not present in the data, the 
proposed method does not detect filaments. This also means that 
the detected filaments in the data are "true filaments" (in the 
sense of our definition) and not a "random alignment of points" 
(false alarms) that may occur by chance even in a binomial point 
process. In that sense, together with the topological information 
given by the sufficient statistics, our model is a good tool for de- 
scribing the network. The strong point of this approach is that 
it allows simultaneous morphological description and statistical 
inference. Another important advantage of using a marked point 
process based methodology is that it allows for the evolution of 
the definition of the objects forming the pattern we are looking 
for. 

One of the messages this paper communicates is that looking 
at two different families of data sets with the same statistical tool, 
we get rather different results from a statistical point of view. 
Therefore, we can safely conclude that the two families of data 
sets are different. 

There are many ways to improve on the work we have done 
so far. We have seen above that it is difficult to find the scale 
(lengths) of the filaments for our model; this problem has to 
be solved. Second, we have used fixed parameters for the data 
term (cylinder sizes); these should be found from the data. Third, 
the filament network seems to be hierarchical, with filaments of 
different widths and sizes; a good model should include this. 
Fourth, parameter estimation and detection validation should be 
also included; the uniform law does not allow the characterisa- 
tion of the model parameters distribution and for the moment we 
cannot say that the detected filamentary pattern is correctly de- 
tected; the only statistical statement that we can do is that this 
pattern is hidden in the data and we have some good ideas about 
where it can be found, but we do not give any precise measure 
about it. 



Also, it would be good if our model could be extended to de- 
scribe inhomogeneous point processes - magnitude-limited cat- 
alogues that have much more galaxies and where the filaments 
can be traced much better. The first rescaling attempt we made 
in this paper could be a step in this direction, but as we saw, it is 
not perfect. And, as usual in astronomy - we would understand 
nature much better if we had more data. The more galaxies we 
see at a given location, the better we can trace their large-scale 
structure. 

The Bayesian framework and the theory of marked point 
process allow the mathematical formulation for filamentary pat- 
tern detection methodologies introducing the previously men- 
tioned improvements (inhomogeneity, different size of objects, 
parameter estimation). The numerical implementation and the 
construction of these improvements in harmony with the astro- 
nomical observations and theoretical knowledge are open and 
challenging problems. 
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